[SPARK-20214][ML] Make sure converted csc matrix has sorted indices#17532
[SPARK-20214][ML] Make sure converted csc matrix has sorted indices#17532viirya wants to merge 2 commits intoapache:masterfrom
Conversation
f88eb6b to
10be94a
Compare
|
Test build #75528 has finished for PR 17532 at commit
|
|
Test build #75530 has finished for PR 17532 at commit
|
|
I don't think is directly related, but, I tried to make a similar change to the Spark Scala implementation of this type of conversion and couldn't make it work, just because it changed a bunch of little behaviors. (I could look up the JIRA.) This could be fine, being more narrow. |
| assert l.shape[1] == 1, "Expected column vector" | ||
| csc = l.tocsc() | ||
| # Make sure the converted csc_matrix has sorted indices. | ||
| csc = l.tocsr().tocsc() |
There was a problem hiding this comment.
I'd just call tocsc. You can then check the attribute has_sorted_indices and call sort_indices() if needed.
|
Are you able to write a unit test which passes data through _convert_to_vector and fails before this fix? |
|
Btw, I'd really like to get this into 2.2, which will be cut soon. Let me know if you'd like me to take it over. Thanks! |
|
@jkbradley Thanks for comment. I will add the unit test now. |
|
@jkbradley An unit test is added. Please check this again. Thanks. |
| self.assertEqual(sv, serialize(lil.tocsr())) | ||
| self.assertEqual(sv, serialize(lil.todok())) | ||
|
|
||
| def test_convert_to_vector(self): |
There was a problem hiding this comment.
We have no Scipy-related tests in ml submodule, so I don't add it. If we really need one, please let me know. I can add it quickly.
There was a problem hiding this comment.
No it's Ok; I made a separate JIRA to port the tests there.
|
Test build #75560 has finished for PR 17532 at commit
|
|
LGTM |
## What changes were proposed in this pull request?
`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
from scipy.sparse import lil_matrix
lil = lil_matrix((4, 1))
lil[1, 0] = 1
lil[3, 0] = 2
_convert_to_vector(lil.todok())
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
return SparseVector(l.shape[0], csc.indices, csc.data)
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
% (self.indices[i], self.indices[i + 1]))
TypeError: Indices 3 and 1 are not strictly increasing
A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
>>> from scipy.sparse import lil_matrix
>>> lil = lil_matrix((4, 1))
>>> lil[1, 0] = 1
>>> lil[3, 0] = 2
>>> dok = lil.todok()
>>> csc = dok.tocsc()
>>> csc.has_sorted_indices
0
>>> csc.indices
array([3, 1], dtype=int32)
I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #17532 from viirya/make-sure-sorted-indices.
(cherry picked from commit 1220605)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
## What changes were proposed in this pull request?
`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
from scipy.sparse import lil_matrix
lil = lil_matrix((4, 1))
lil[1, 0] = 1
lil[3, 0] = 2
_convert_to_vector(lil.todok())
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
return SparseVector(l.shape[0], csc.indices, csc.data)
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
% (self.indices[i], self.indices[i + 1]))
TypeError: Indices 3 and 1 are not strictly increasing
A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
>>> from scipy.sparse import lil_matrix
>>> lil = lil_matrix((4, 1))
>>> lil[1, 0] = 1
>>> lil[3, 0] = 2
>>> dok = lil.todok()
>>> csc = dok.tocsc()
>>> csc.has_sorted_indices
0
>>> csc.indices
array([3, 1], dtype=int32)
I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #17532 from viirya/make-sure-sorted-indices.
(cherry picked from commit 1220605)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
What changes were proposed in this pull request?
_convert_to_vectorconverts a scipy sparse matrix to csc matrix for initializingSparseVector. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:A simple test can confirm that
dok_matrix.tocsc()won't guarantee sorted indices:I checked the source codes of scipy. The only way to guarantee it is
csc_matrix.tocsr()andcsr_matrix.tocsc().How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.